{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第21章 PageRank算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. PageRank是互联网网页重要度的计算方法，可以定义推广到任意有向图结点的重要度计算上。其基本思想是在有向图上定义随机游走模型，即一阶马尔可夫链，描述游走者沿着有向图随机访问各个结点的行为，在一定条件下，极限情况访问每个结点的概率收敛到平稳分布，这时各个结点的概率值就是其 PageRank值，表示结点相对重要度。\n",
    "\n",
    "2. 有向图上可以定义随机游走模型，即一阶马尔可夫链，其中结点表示状态，有向边表示状态之间的转移，假设一个结点到连接出的所有结点的转移概率相等。转移概率由转移矩阵$M$表示\n",
    "$$M = [ m _ { i j } ] _ { n \\times n }$$\n",
    "第$i$行第$j$列的元素$m _ { i j }$表示从结点$j$跳转到结点$i$的概率。\n",
    "\n",
    "3. 当含有$n$个结点的有向图是强连通且非周期性的有向图时，在其基础上定义的随机游走模型，即一阶马尔可夫链具有平稳分布，平稳分布向量$R$称为这个有向图的 PageRank。若矩阵$M$是马尔可夫链的转移矩阵，则向量R满足$$MR=R$$向量$R$的各个分量称 PageRank为各个结点的值。\n",
    "$$R = \\left[ \\begin{array} { c } { P R ( v _ { 1 } ) } \\\\ { P R ( v _ { 2 } ) } \\\\ { \\vdots } \\\\ { P R ( v _ { n } ) } \\end{array} \\right]$$\n",
    "其中$P R ( v _ { i } ) , i = 1,2 , \\cdots , n$，表示结点$v_i$的 PageRank值。这是 PageRank的基本定义。\n",
    "\n",
    "4. PageRank基本定义的条件现实中往往不能满足，对其进行扩展得到 PageRank的一般定义。任意含有$n$个结点的有向图上，可以定义一个随机游走模型，即一阶马尔可夫链，转移矩阵由两部分的线性组合组成，其中一部分按照转移矩阵$M$，从一个结点到连接出的所有结点的转移概率相等，另一部分按照完全随机转移矩阵，从任一结点到任一结点的转移概率都是$1/n$。这个马尔可夫链存在平稳分布，平稳分布向量R称为这个有 PageRank向图的一般，满足\n",
    "$$R = d M R + \\frac { 1 - d } { n } 1$$\n",
    "\n",
    "其中$d ( 0 \\leq d \\leq 1 )$是阻尼因子，1是所有分量为1的$n$维向量。\n",
    "\n",
    "5. PageRank的计算方法包括迭代算法、幂法、代数算法。\n",
    "\n",
    "幂法将 PageRank的等价式写成$$R = ( d M + \\frac { 1 - d } { n } E ) R = A R$$\n",
    "其中$d$是阻尼因子，$E$是所有元素为1的$n$阶方阵。\n",
    "\n",
    "PageRank算法可以看出$R$是一般转移矩阵$A$的主特征向量，即最大的特征值对应的特征向量。\n",
    "幂法就是一个计算矩阵的主特征值和主特征向量的方法。\n",
    "\n",
    "步骤是：选择初始向量$x_0$；计算一般转移矩阵$A$；进行迭代并规范化向量\n",
    "$$y _ { t + 1 } = A x _ { t }$$\n",
    "$$x _ { t + 1 } = \\frac { y _ { t + 1 } } { \\| y _ { t + 1 } \\| }$$\n",
    "直至收敛。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "iWOZV94kYsbM"
   },
   "source": [
    "---\n",
    "在实际应用中许多数据都以图(graph)的形式存在，比如，互联网、社交网络都可以看作是一个图。图数据上的机器学习具有理论与应用上的重要意义。pageRank算法是图的链接分析 (link analysis)的代表性算法，属于图数据上的无监督学习方法。  \n",
    "\n",
    "pageRank算法最初作为互联网网页重要度的计算方法，1996年由page和Brin提出，并用于谷歌搜索引擎的网页排序。事实上，pageRank可以定义在任意有向图上，后来被应用到社会影响力分析、文本摘要等多个问题。  \n",
    "\n",
    "pageRank算法的基本想法是在有向图上定义一个随机游走模型，即一阶马尔可夫链，描述随机游走者沿着有向图随机访问各个结点的行为。在一定条件下，极限情况访问每个结点的概率收敛到平稳分布， 这时各个结点的平稳概率值就是其 pageRank值，表示结点的重要度。 pageRank是递归定义的，pageRank的计算可以通过迭代算法进行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "fAN4q0cqYn-f"
   },
   "outputs": [],
   "source": [
    "#https://gist.github.com/diogojc/1338222/84d767a68da711a154778fb1d00e772d65322187\n",
    "\n",
    "import numpy as np\n",
    "from scipy.sparse import csc_matrix\n",
    "\n",
    "\n",
    "def pageRank(G, s=.85, maxerr=.0001):\n",
    "    \"\"\"\n",
    "    Computes the pagerank for each of the n states\n",
    "    Parameters\n",
    "    ----------\n",
    "    G: matrix representing state transitions\n",
    "       Gij is a binary value representing a transition from state i to j.\n",
    "    s: probability of following a transition. 1-s probability of teleporting\n",
    "       to another state.\n",
    "    maxerr: if the sum of pageranks between iterations is bellow this we will\n",
    "            have converged.\n",
    "    \"\"\"\n",
    "    n = G.shape[0]\n",
    "\n",
    "    # transform G into markov matrix A\n",
    "    A = csc_matrix(G, dtype=np.float)\n",
    "    rsums = np.array(A.sum(1))[:, 0]\n",
    "    ri, ci = A.nonzero()\n",
    "    A.data /= rsums[ri]\n",
    "\n",
    "    # bool array of sink states\n",
    "    sink = rsums == 0\n",
    "\n",
    "    # Compute pagerank r until we converge\n",
    "    ro, r = np.zeros(n), np.ones(n)\n",
    "    while np.sum(np.abs(r - ro)) > maxerr:\n",
    "        ro = r.copy()\n",
    "        # calculate each pagerank at a time\n",
    "        for i in range(0, n):\n",
    "            # inlinks of state i\n",
    "            Ai = np.array(A[:, i].todense())[:, 0]\n",
    "            # account for sink states\n",
    "            Di = sink / float(n)\n",
    "            # account for teleportation to state i\n",
    "            Ei = np.ones(n) / float(n)\n",
    "\n",
    "            r[i] = ro.dot(Ai * s + Di * s + Ei * (1 - s))\n",
    "\n",
    "    # return normalized pagerank\n",
    "    return r / float(sum(r))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 53
    },
    "colab_type": "code",
    "id": "Ds-wQEFFZ1F7",
    "outputId": "b2860902-8712-4583-ab47-bec602c6791b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.12727557 0.03616954 0.12221594 0.22608452 0.28934412 0.03616954\n",
      " 0.16274076]\n"
     ]
    }
   ],
   "source": [
    "# Example extracted from 'Introduction to Information Retrieval'\n",
    "G = np.array([[0,0,1,0,0,0,0],\n",
    "              [0,1,1,0,0,0,0],\n",
    "              [1,0,1,1,0,0,0],\n",
    "              [0,0,0,1,1,0,0],\n",
    "              [0,0,0,0,0,0,1],\n",
    "              [0,0,0,0,0,1,1],\n",
    "              [0,0,0,1,1,0,1]])\n",
    "print(pageRank(G,s=.86))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "本章代码来源：https://github.com/hktxt/Learn-Statistical-Learning-Method\n",
    "\n",
    "本文代码更新地址：https://github.com/fengdu78/lihang-code\n",
    "\n",
    "中文注释制作：机器学习初学者公众号：ID:ai-start-com\n",
    "\n",
    "配置环境：python 3.5+\n",
    "\n",
    "代码全部测试通过。\n",
    "![gongzhong](../gongzhong.jpg)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "PageRank.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
